Pandas is probably (together with scipy, numpy, and sklearn) the main reason that Python has become popular for data science. According to ‘Learn Data Sci’ it accounts for 1% of all Stack Overflow question views!
You will want to bookmark these:
Here’s code to read a (remote) CSV file:
import pandas as pd # import package
# Bitly: https://raw.githubusercontent.com/jreades/fsds/master/data/2019-sample-crime.csv
url='https://bit.ly/39SJpfp'
df = pd.read_csv(url) # load a (remote) CSV
print(type(df)) # not simple data type
print(df.columns.values) # column namesOutput:
df.info is more about data types and memory usage. df.describe is for summarising information about the distribution of values in every series.
This should be looking eerily familiar:
print(type(df['Latitude'])) # type for column
print(type(df['Latitude'].array)) # type for values
print(df['Latitude'].array[:5]) # first five values
print(f"1: {df['Latitude'].mean()}") # summarise a series/column
print(f"2: {df.Latitude.mean()}") # if no spaces in nameProduces:
Pandas is also ‘Jupyter-aware’, meaning that output can displayed directly in Jupyter in ‘fancy’ ways:
Pandas operates on two principles:
More subtly, operations on a Series or Data Frame return a shallow copy, which is like a ‘view’ in a database…
inplace=True (where supported).SettingWithCopyWarning warning.If you need a full copy then use the copy() method (e.g. df.copy() or df.Series.copy()).
# Returns a series but not a column
df.Latitude - 1
# Saves returned series as a new column
df['lat'] = df.Latitude - 1
# Returns a new data frame w/o 'lat'
df.drop(columns=['lat'])
# Modifies df directly
df.drop(columns=['lat'], inplace=True)
# Try to modify a view of df (triggers warning)
df[df['Primary Type']=='BURGLARY'].Latitude = 41.7 Operations on a Data Frame return a DataFrame and operations on a Series return a Series, allowing us to ‘chain’ steps together:
# Returns a selection (Boolean series)
df['Primary Type']=='ASSAULT'
# All rows where Primary Type is ASSAULT
df[ df['Primary Type']=='ASSAULT' ]
# Calculations on a slice (returns mean centroid!)
df[df['Primary Type']=='ASSAULT'][['Longitude','Latitude']].mean()
# Two conditions with a bit-wise AND
df[
(df['Primary Type']=='ASSAULT') &
(df['Description']=='AGGRAVATED: HANDGUN')
]
# Two conditions with a bit-wise OR
df[
(df['Primary Type']=='ASSAULT') |
(df['Primary Type']=='THEFT')
]A Data Series can only be of one type:
| Pandas Dtype | Python Type | Usage |
|---|---|---|
object |
str or mixed |
Text or mixed columns (including arrays) |
int64 |
int |
Integer columns |
float64 |
float |
Floating point columns |
bool |
bool |
True/False columns |
datetime64 |
N/A (datetime) |
Date and time columns |
timedelta[ns] |
N/A (datetime) |
Datetime difference columns |
category |
N/A (set) |
Categorical columns |
print(df['Primary Type'].unique()) # Find unique values
print(df['Primary Type'].dtype.name) # Confirm is 'object'
df['Primary Type'] = df['Primary Type'].astype('category')
print(df['Primary Type'].dtype.name) # Confirm is 'category'
print(df['Primary Type'].describe()) # Category column infoOutputs:
What do we do here?
print(df.Date.dtype.name)
# object
df.Date.to_list()[:3]
# ['04/20/2019 11:00:00 PM', '12/02/2019 10:35:00 AM', '10/06/2019 04:50:00 PM']This shows that Date is currently a string of dates+times.
Pandas handles date and times using a datetime type that also works as an index (more on these later):
df['dt'] = pd.to_datetime(df.Date.values,
format="%m/%d/%Y %H:%M:%S %p")
print(df.dt.dtype.name)
# datetime64[ns]
df.dt.to_list()[:3]
# [Timestamp('2019-04-20 11:00:00'), Timestamp('2019-12-02 10:35:00'), Timestamp('2019-10-06 04:50:00')]These follow the formatting conventions of strftime (string format time) for conversion.
Examples of strftime conventions include:
| Format | Applies To |
|---|---|
| %d | 2-digit day |
| %m | 2-digit month |
| %y | 2-digit year |
| %Y | 4-digit year |
| %p | AM/PM |
So that is why:
Note the other things happening here:
pd.to_datetime(...) is not a method, it’s a function from the pandas package.df.Date.array (and df.Date.to_numpy() and df.Data.tolist()) gives access to the data directly, whereas df.Date gives access to the Series.From time to time, real-world software projects will change the way things work. Pandas is just such a project!
Warning
We recommend using Series.array or Series.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array. See API Documenation.
So while Series.values still works, and will continue to work for some time, you are being advised to start using Series.array or Series.to_numpy() instead. Meaning, we should consider using df.Date.array.
This is one way, there are many options and subtleties…
# Fix categories
mapping = {}
# df['Primary Type'].unique().to_list() also works
for x in df['Primary Type'].cat.categories.to_list():
mapping[x]=x.title()
# And update
df['Primary Type'] = df['Primary Type'].cat.rename_categories(mapping)How would you work out what this code does? 1
To deal with pricing information treated as a string:
Many more examples accessible via Google!
There are multiple ways to drop ‘stuff’:
df2 = df.copy()
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
df2.drop(index=range(5,10), inplace=True) # Row 'numbers' or index values
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
df.drop(columns=['Year'], inplace=True) # Column name(s)
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")There is also df.dropna() which can apply to rows or columns with NULL or np.nan values.
I often prefer df = df[df.index > 15] (negative selection) to df.drop(index=range(0,14)) (positive selection).
| Index | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| ID | Case Number | Date | Primary Type | |
| 0 | 11667185 | JC237601 | 04/20/2020 11:00:00PM | BURGLARY |
| 1 | 11998178 | JC532226 | 12/02/2020 10:35:00AM | DECEPTIVE PRACTICE |
| 2 | 11852571 | JC462365 | 10/06/2020 04:50:00PM | BATTERY |
We can interact with rows and columns by position or name:
df.iloc[0:2,0:2] # List selection! (':' means 'all')
df.loc[0:2,['ID','Case Number']] # Dict selectionThese actually return different results because of the index:
df.loc returns the rows labeled 0, 1, and 2 ([0..2]), whereasdf.iloc returns the range 0..2 ([0..2))!So by default, pandas creates a row index index whose values are 0..n and column index whose values are the column names. You will see this if you print out the head:
The left-most column (without) a name is the index.
Now we see:
Case Number Date ... Longitude dt
ID ...
11667185 JC237601 04/20/2019 11:00:00 PM ... -87.603468 2019-04-20 11:00:00
11909178 JC532226 12/02/2019 10:35:00 AM ... -87.643230 2019-12-02 10:35:00
11852571 JC462365 10/06/2019 04:50:00 PM ... -87.758473 2019-10-06 04:50:00
So ID is now the index and is not accessible as a column: df.ID will now throw an error because it’s not longer part of the Column Index.
Notice the change to the data frame:
| 0 | 1 | 2 | |
|---|---|---|---|
| ID | Case Number | Date | Primary Type |
| 11667185 | JC237601 | 04/20/2020 11:00:00PM | BURGLARY |
| 11998178 | JC532226 | 12/02/2020 10:35:00AM | DECEPTIVE PRACTICE |
| 11852571 | JC462365 | 10/06/2020 04:50:00PM | BATTERY |
And now:
Mnemonic: we used iloc to select rows/cols based on integer location and we use loc to select rows/cols based on name location.
P.S. You can reset the data frame using df.reset_index(inplace=True).
Pandas can write to a wide range of file types, here are some of the more popular ones:
| Command | Saved As… |
|---|---|
df.to_csv(<path>) |
CSV file. But note the options to change sep (default is ',') and to suppress index output (index=False). |
df.to_excel(<path>) |
XLSX file. But note the options to specify a sheet_name, na_rep, and so on, as well as to suppress the index (index=False). |
df.to_feather(<path>) |
Directly usable by R. Requires pyarrow to be installed to access the options. |
df.to_parquet(<path>) |
Directly usable by many languages. Requires pyarrow to be installed to access the options. |
df.to_latex(<path>)) |
Write a LaTeX-formatted table to a file. Display requires booktabs. Could do copy+paste with print(df.to_latex()). |
df.to_markdown(<path>) |
Write a Markdown-formatted table to a file. Requires tabulate. Could do copy+paste with print(df.to_markdown()). |
In most cases compression is detected automatically (e.g. df.to_csv('file.csv.gz')) but you can also specify it (e.g. df.to_csv('file.csv.gz', compression='gzip')).1
Pandas • Jon Reades